AWS Bedrock Knowledge Base

How Large Language Models Work

Table of Contents

Overview

This guide explains the internal process of how a Large Language Model (LLM) transforms your text prompt into a coherent answer.

The Four-Stage Pipeline

1. Prompt Text → Tokenization

What happens: - Your input text is broken down into smaller units called "tokens" - Tokens can be words, subwords, or even individual characters - Each token is converted to a numerical ID from the model's vocabulary

Example:

Input: "How do neural networks learn?"
Tokens: ["How", " do", " neural", " networks", " learn", "?"]
Token IDs: [2437, 466, 17019, 7686, 2193, 30]

Key concepts: - Vocabulary size: typically 32k-100k+ tokens - Special tokens: <start>, <end>, <pad> for structure - Subword tokenization: handles rare words by breaking them into parts - Token limit: models have maximum context windows (e.g., 4k, 8k, 128k tokens)


2. Token Processing → Embeddings

What happens: - Each token ID is converted into a high-dimensional vector (embedding) - Embeddings capture semantic meaning in numerical space - Position encodings are added to preserve word order

Example:

Token "neural" → [0.23, -0.45, 0.67, ..., 0.12] (768 dimensions)
Token "network" → [0.19, -0.41, 0.71, ..., 0.09] (768 dimensions)

Key concepts: - Embedding dimension: typically 768-12,288 dimensions - Similar words have similar vector representations - Position matters: "dog bites man" ≠ "man bites dog"


3. Reasoning → Transformer Layers

What happens: - Embeddings flow through multiple transformer layers (12-96+ layers) - Each layer performs self-attention and feed-forward operations - The model identifies patterns, relationships, and context

Self-Attention Mechanism:

For each token:
1. Look at all other tokens in the context
2. Calculate relevance scores (attention weights)
3. Combine information from relevant tokens
4. Update the token's representation

Example attention pattern:

Input: "The cat sat on the mat because it was comfortable"
Token "it" attends strongly to → "mat" (or "cat")

Layer-by-layer processing: - Early layers: syntax, grammar, basic patterns - Middle layers: semantic relationships, entity recognition - Late layers: abstract reasoning, task-specific logic

Key concepts: - Multi-head attention: parallel attention mechanisms (8-96 heads) - Residual connections: preserve information across layers - Layer normalization: stabilize training and inference - Feed-forward networks: non-linear transformations


4. Text Answer Building → Decoding

What happens: - The final layer outputs probability distributions over vocabulary - The model selects the next token based on sampling strategy - This process repeats autoregressively until completion

Decoding strategies:

Greedy decoding:

Always pick the highest probability token
→ Deterministic but sometimes repetitive

Temperature sampling:

temperature = 0.0  → deterministic (always most likely)
temperature = 0.7  → balanced creativity
temperature = 1.5  → very creative/random

Top-k sampling:

Consider only the k most likely tokens (e.g., k=40)
Sample from this restricted set

Top-p (nucleus) sampling:

Consider tokens until cumulative probability reaches p (e.g., p=0.9)
More dynamic than top-k

Example generation:

Prompt: "The capital of France is"
Step 1: Model outputs → " Paris" (95% probability)
Step 2: Model outputs → "," (60% probability)
Step 3: Model outputs → " which" (45% probability)
...continues until <end> token or max length

Complete Flow Example

Input Prompt:

"Explain photosynthesis in simple terms"

Step-by-step process:

1. Tokenization:

["Explain", " photo", "synthesis", " in", " simple", " terms"]
→ [8849, 5052, 48935, 287, 2829, 2846]

2. Embedding:

Each token → 768-dimensional vector
+ positional encoding (token 0, 1, 2, ...)

3. Transformer processing (simplified):

Layer 1:  Recognizes "Explain" is a request
Layer 5:  Understands "photosynthesis" is a biological process
Layer 10: Connects "simple terms" → need for accessible explanation
Layer 15: Activates knowledge about plants, sunlight, energy
Layer 20: Formulates explanation structure

4. Generation:

Token 1: "Photo" (start of answer)
Token 2: "synthesis"
Token 3: " is"
Token 4: " the"
Token 5: " process"
...
(continues until complete answer)

Key Parameters That Affect Output

Temperature (0.0 - 2.0)

Top-p / Top-k

Max tokens

Frequency penalty

Presence penalty


Model Architecture Components

Core elements:

  1. Token embeddings: Convert IDs to vectors
  2. Position embeddings: Encode sequence order
  3. Attention layers: Identify relationships between tokens
  4. Feed-forward layers: Transform representations
  5. Layer normalization: Stabilize activations
  6. Output projection: Convert to vocabulary probabilities

Model sizes:


Training vs Inference

Training (how models learn):

Inference (how models respond):


Limitations and Considerations

Context window:

Knowledge cutoff:

Hallucinations:

Reasoning limitations:


Optimization Techniques

Quantization:

Caching:

Batching:


How AI Agents Use LLMs

An AI agent is a system that uses an LLM as its "brain" but extends it with additional capabilities like tool use, memory, and planning. Here's how agents work:

Basic Agent Architecture

User Request
    ↓
Agent System (orchestration layer)
    ↓
┌─────────────────────────────────────┐
│  LLM (reasoning engine)             │
│  - Understands request              │
│  - Plans actions                    │
│  - Decides what tools to use        │
└─────────────────────────────────────┘
    ↓
Tool Execution (external actions)
    ↓
Results fed back to LLM
    ↓
Final Response to User

The Agent Loop (ReAct Pattern)

Agents typically follow a Thought → Action → Observation cycle:

Example: "What's the weather in Paris and convert the temperature to Celsius?"

Iteration 1:
  Thought: "I need to get the current weather in Paris"
  Action: call_tool("get_weather", {"city": "Paris"})
  Observation: "Temperature: 72°F, Sunny"

Iteration 2:
  Thought: "I have the temperature in Fahrenheit, need to convert to Celsius"
  Action: call_tool("convert_temperature", {"value": 72, "from": "F", "to": "C"})
  Observation: "22.2°C"

Iteration 3:
  Thought: "I have all the information needed"
  Action: respond_to_user
  Response: "The weather in Paris is sunny with a temperature of 22.2°C (72°F)."

Key Components of an Agent

1. System Prompt (Instructions)

You are an AI assistant with access to tools.
When you need information, use the available tools.
Always explain your reasoning before taking action.

Available tools:
- search_web(query): Search the internet
- read_file(path): Read a file
- execute_code(code): Run Python code

2. Tool Definitions

{
  "name": "search_web",
  "description": "Search the internet for current information",
  "parameters": {
    "query": "string - the search query"
  }
}

3. Conversation Memory

[Previous messages]
User: "Find the population of Tokyo"
Assistant: [used search_web] "Tokyo has 14 million people"
User: "What about Paris?"
Assistant: [remembers context] [uses search_web] "Paris has 2.1 million people"

How Agents Extend LLM Capabilities

Limitation How Agents Solve It
No real-time data Connect to APIs, databases, search engines
Can't perform actions Execute code, modify files, send emails
Limited memory Store conversation history, use vector databases
No access to private data Read from user's files, databases, documents
Can't verify facts Use tools to check information, run calculations

Agent Execution Flow

Step 1: Prompt Construction

System Instructions
+
Tool Definitions
+
Conversation History
+
User Request
→ Sent to LLM

Step 2: LLM Response Parsing

LLM Output: "I need to search for information. 
             <tool_call>search_web("Paris weather")</tool_call>"

Agent parses this and extracts:
- Tool name: search_web
- Parameters: {"query": "Paris weather"}

Step 3: Tool Execution

Agent executes: search_web("Paris weather")
Result: "Current weather in Paris: 22°C, Sunny"

Step 4: Result Injection

Agent adds result to context:
"Tool result: Current weather in Paris: 22°C, Sunny"
→ Sends back to LLM for next decision

Step 5: Iteration or Completion

LLM decides:
- Need more tools? → Repeat cycle
- Have enough info? → Generate final response

Types of Agent Architectures

1. ReAct (Reasoning + Acting) - LLM reasons about what to do - Executes actions via tools - Observes results and continues

2. Plan-and-Execute - LLM creates a complete plan first - Agent executes all steps - Less flexible but more predictable

3. Autonomous Agents - Given high-level goals - Continuously run until goal achieved - Can spawn sub-tasks

4. Multi-Agent Systems - Multiple specialized agents - Each has different tools/expertise - Collaborate to solve complex tasks

Tool Calling Formats

Function Calling (Structured)

{
  "tool": "get_weather",
  "arguments": {
    "city": "Paris",
    "units": "celsius"
  }
}

Natural Language (Parsed)

I'll use the weather tool to check Paris.
ACTION: get_weather(city="Paris", units="celsius")

XML Format

<tool_call>
  <name>get_weather</name>
  <parameters>
    <city>Paris</city>
    <units>celsius</units>
  </parameters>
</tool_call>

Agent Memory Systems

Short-term Memory: - Current conversation context - Recent tool results - Stored in prompt/context window

Long-term Memory: - Vector database for semantic search - Key-value stores for facts - Retrieved when relevant

Example:

User: "Remember my favorite color is blue"
→ Agent stores: {"user_preference": "favorite_color", "value": "blue"}

Later...
User: "What color should I paint my room?"
→ Agent retrieves: "favorite_color = blue"
→ Response: "Since your favorite color is blue, you might consider..."

Error Handling and Retries

Agents handle failures that LLMs alone cannot:

Attempt 1: call_tool("search", {"query": ""})
Error: "Query cannot be empty"

Agent injects error into context:
"Tool error: Query cannot be empty. Please provide a valid query."

LLM adjusts:
Attempt 2: call_tool("search", {"query": "Paris weather"})
Success: Returns weather data

Agent vs Pure LLM

Feature Pure LLM AI Agent
Knowledge Training data only Can access real-time data
Actions Generate text only Execute code, API calls, file operations
Memory Context window only Persistent storage, retrieval
Accuracy May hallucinate Can verify with tools
Autonomy Single response Multi-step task completion
Cost One API call Multiple API calls (LLM + tools)

Real-World Agent Example

Task: "Analyze the sales data from last month and create a report"

Step 1: LLM plans
  Thought: "I need to read the sales data file"
  Action: read_file("sales_2024_12.csv")

Step 2: LLM analyzes
  Observation: [CSV data received]
  Thought: "I should calculate total sales and trends"
  Action: execute_code("import pandas as pd; df = pd.read_csv(...)")

Step 3: LLM generates insights
  Observation: [Analysis results]
  Thought: "Now I'll create a formatted report"
  Action: write_file("sales_report.md", content)

Step 4: LLM confirms
  Observation: [File created successfully]
  Response: "I've analyzed the sales data and created a report..."

Best Practices for Agent Design

1. Clear tool descriptions - LLM needs to understand when to use each tool - Include examples in tool documentation

2. Limit tool complexity - Simple, focused tools work better - Break complex operations into smaller tools

3. Provide feedback loops - Always return tool results to the LLM - Let LLM verify and adjust

4. Set iteration limits - Prevent infinite loops - Typical limit: 5-10 iterations

5. Use structured outputs - JSON or XML for tool calls - Easier to parse reliably

Agent Limitations

Cost: - Multiple LLM calls per task - Can be expensive for complex workflows

Latency: - Each tool call adds delay - Multi-step tasks take longer

Reliability: - More complex = more failure points - LLM might choose wrong tools

Unpredictability: - Agent behavior can vary - Same task might use different approaches


Summary

The LLM pipeline:

Text Prompt
    ↓
Tokenization (text → token IDs)
    ↓
Embedding (IDs → vectors)
    ↓
Transformer Layers (reasoning & pattern matching)
    ↓
Output Projection (vectors → probabilities)
    ↓
Decoding (probabilities → tokens)
    ↓
Detokenization (tokens → text)
    ↓
Generated Answer

Each stage is deterministic given the same inputs and parameters, but sampling strategies introduce controlled randomness to create diverse, natural responses.

Agents extend this pipeline by wrapping the LLM in an orchestration layer that enables tool use, memory, and multi-step reasoning, transforming a text generator into an autonomous problem-solver.